It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised learning in the past year. Benefiting from the universal backbone vision transformer, MIM learns self-supervised visual representations through masking a part of patches of the image while attempting to recover the missing pixels. Most previous works mask patches of the image randomly, which underutilizes the semantic information that is beneficial to visual representation learning. On the other hand, due to the large size of the backbone, most previous works have to spend much time on pre-training. In this paper, we propose \textbf{Attention-driven Masking and Throwing Strategy} (AMT), which could solve both problems above. We first leverage the self-attention mechanism to obtain the semantic information of the image during the training process automatically without using any supervised methods. Masking strategy can be guided by that information to mask areas selectively, which is helpful for representation learning. Moreover, a redundant patch throwing strategy is proposed, which makes learning more efficient. As a plug-and-play module for masked image modeling, AMT improves the linear probing accuracy of MAE by $2.9\% \sim 5.9\%$ on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K, and obtains an improved performance with respect to fine-tuning accuracy of MAE and SimMIM. Moreover, this design also achieves superior performance on downstream detection and segmentation tasks.
translated by 谷歌翻译
Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
translated by 谷歌翻译
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
translated by 谷歌翻译
光学计算是一种新兴技术,用于下一代高效人工智能(AI),其速度和效率超高。电磁场模拟对于光子设备和电路的设计,优化和验证至关重要。但是,昂贵的数值模拟显着阻碍了光子电路设计循环中的可扩展性和转环。最近,已经提出了物理信息的神经网络来预测具有预定义参数的部分微分方程(PDE)的单个实例的光场解。它们复杂的PDE公式和缺乏有效的参数化机制限制了其在实际模拟方案中的灵活性和概括。在这项工作中,首次提出了一个被称为Neurolight的物理敏捷神经操作员框架,以学习一个频率域的麦克斯韦PDE家族,以进行超快速的参数光子设备模拟。我们通过几种新技术来平衡神经照明的效率和概括。具体而言,我们将不同的设备离散到统一域中,代表具有紧凑型波的参数PDE,并通过掩盖的源建模编码入射光。我们使用参数效率高的跨形神经块设计模型,并采用基于叠加的增强来进行数据效率学习。通过这些协同方法,神经亮像可以概括为大量的看不见的模拟设置,比数值求解器显示了2个磁性的模拟速度,并且比先前的神经网络模型优于降低54%的预测误差,而降低了约44%的参数。 。我们的代码可在https://github.com/jeremiemelo/neurolight上找到。
translated by 谷歌翻译
最近,先驱研究工作提出了大量的声学特征(原木功率谱图,线性频率卷轴系数,恒定的q cepstral系数等),以进行音频深层检测,获得良好的性能,并表明不同的子带对音频有不同的贡献DeepFake检测。但是,这缺乏对子带中特定信息的解释,这些功能也丢失了诸如阶段之类的信息。受合成语音机制的启发,基本频率(F0)信息用于提高综合语音的质量,而合成语音的F0仍然太平均,这与真实语音的F0差异很大。可以预期,F0可以用作重要信息来区分真正的语言和虚假语音,而由于F0的分布不规则,因此不能直接使用此信息。相反,选择了大多数F0的频带作为输入特征。同时,为了充分利用相位和全频段信息,我们还建议使用真实和虚构的频谱图作为互补输入功能,并分别对Discoint子带进行建模。最后,融合了F0的结果,真实和假想的频谱图。 ASVSPOOF 2019 LA数据集的实验结果表明,我们所提出的系统对于音频DeepFake检测任务非常有效,达到等效错误率(EER)为0.43%,几乎超过了所有系统。
translated by 谷歌翻译
数据在于现代深度学习的核心。监督学习的令人印象深刻的表现建立在大量准确标记的数据基础上。但是,在某些现实世界中,准确的标签可能不可行。取而代之的是,为每个数据示例提供了多个注释者提供多个嘈杂标签(而不是一个精确的标签)。在这样的嘈杂培训数据集上学习分类器是一项具有挑战性的任务。以前的方法通常假设所有数据示例共享与注释误差相关的相同参数集,而我们证明标签错误学习应既是注释者,又是数据示例依赖性。在这一观察结果的激励下,我们提出了一种新颖的学习算法。与MNIST,CIFAR-100和Imagenet-100的几种最新基线方法相比,该方法显示出优势。我们的代码可在以下网址获得:https://github.com/zhengqigao/learning-from-multiple-annotator-noisy-labels。
translated by 谷歌翻译
我们提出了一种学习方法,可以从单个视图开始生成自然场景的无界飞行视频,在该视图中,从单个照片集中学习了这种功能,而无需每个场景的相机姿势甚至多个视图。为了实现这一目标,我们提出了一种新颖的自我监督视图生成训练范式,在这里我们采样和渲染虚拟摄像头轨迹,包括循环轨迹,使我们的模型可以从单个视图集合中学习稳定的视图生成。在测试时,尽管在训练过程中从未见过视频,但我们的方法可以拍摄单个图像,并产生长的相机轨迹,包括数百个新视图,具有现实和多样化的内容。我们将我们的方法与最新的监督视图生成方法进行了比较,该方法需要摆姿势的多视频视频,并展示了卓越的性能和综合质量。
translated by 谷歌翻译
数据混合(例如混合,cutmix,resizemix)是推进识别模型的重要组成部分。在本文中,我们专注于研究其在自我监督环境中的有效性。通过注意共享相同源图像的混合图像彼此内在相关,我们在此提议SDMP,缩写为$ \ textbf {s} $ imple $ \ textbf {d} $ ata $ \ ata $ \ textbf {m} $ ixing $ ixing $ \ textbf {p} $ rior,要捕获这个直接但必不可少的先验,并将混合图像定位为其他$ \ textbf {potition pairs} $,以促进自我监督的表示的学习。我们的实验验证了所提出的SDMP可以使数据混合有助于一组自学的学习框架(例如MoCo)实现更好的准确性和分布外的鲁棒性。更值得注意的是,我们的SDMP是成功利用数据混合以改善(而不是伤害)视觉变压器在自我监督的环境中的性能的第一种方法。代码可在https://github.com/oliverrensu/sdmp上公开获取
translated by 谷歌翻译
多模式知识蒸馏(KD)将传统知识蒸馏扩展到多模式学习的领域。一种常见的做法是采用良好的多式联运网络作为老师,希望它可以将其全部知识转移到单形学生以提高绩效。在本文中,我们研究了多模式KD的功效。我们首先提供了两个失败情况,并证明KD不是多模式知识转移中的普遍治疗方法。我们介绍了维恩图的模态,以了解模态关系和焦点的假设,从而揭示了多模式KD功效的决定性因素。6个多模式数据集的实验结果有助于证明我们的假设,诊断失败情况和点方向以提高蒸馏性能。
translated by 谷歌翻译
无需后续文本分割的准确布局分析仍然是一个持续的挑战,特别是在面对kangyur时,一种历史藏文档,具有相当大的触摸部件和斑驳的背景。旨在识别文档图像中的不同区域,对于诸如字符识别的后续程序,布局分析是必不可少的。然而,只有一点研究正在进行执行线路级布局分析,该分析未能处理Kangyur。为了获得最佳结果,提出了一种细粒度的子线级布局分析方法。首先,我们推出了一种加速方法来构建动态且可靠的数据集。其次,根据kangyur的特征对索洛夫2进行了增强。然后,我们在训练阶段将增强索入索维2馈出了准备的注释文件。一旦培训网络,可以在推断阶段分段和识别文本行,句子和标题的文本行和标题的实例。实验结果表明,该方法在我们的数据集中提供了一个体面的72.7%的平均精度。通常,这项初步研究提供了对细粒度的子线级布局分析的见解,并证明了基于索洛夫2的方法。我们还认为,所提出的方法可以在具有各种布局的其他语言文件上采用。
translated by 谷歌翻译